This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH.
Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach.
Please submit both Rpubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.
We will describe the data wrangling steps in brief before doing any exploratory data analysis and model building. After loading the raw data files, our transformations and changes are as follows:
id to allow unique identification of all raw observations.PH.PH value into a 80/20 train and validate data set. We recognize that a test data set exists but this is for external assessment of the project prediction accuracy.BrandCode categorical value with a dummy BrandCode EMFR missing value with the training set median.Along the way, we will assess the numerical impact of such changes to the dataset.
## [1] 2571 33
Let’s evaluate the data characteristics of the raw file after loading. We use the skimr library to get statistics quickly for the entire training data set. The table below shows the numeric variables sorted by their completion rate in descending order. So the most problematic predictors are at the top of the table.
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| MFR | 212 | 0.918 | 704.049 | 73.898 | 31.400 | 706.300 | 724.000 | 731.000 | 868.600 | ▁▁▁▂▇ |
| Filler Speed | 57 | 0.978 | 3687.199 | 770.820 | 998.000 | 3888.000 | 3982.000 | 3998.000 | 4030.000 | ▁▁▁▁▇ |
| PC Volume | 39 | 0.985 | 0.277 | 0.061 | 0.079 | 0.239 | 0.271 | 0.312 | 0.478 | ▁▃▇▂▁ |
| PSC CO2 | 39 | 0.985 | 0.056 | 0.043 | 0.000 | 0.020 | 0.040 | 0.080 | 0.240 | ▇▅▂▁▁ |
| Fill Ounces | 38 | 0.985 | 23.975 | 0.088 | 23.633 | 23.920 | 23.973 | 24.027 | 24.320 | ▁▂▇▂▁ |
| PSC | 33 | 0.987 | 0.085 | 0.049 | 0.002 | 0.048 | 0.076 | 0.112 | 0.270 | ▆▇▃▁▁ |
| Carb Pressure1 | 32 | 0.988 | 122.586 | 4.743 | 105.600 | 119.000 | 123.200 | 125.400 | 140.200 | ▁▃▇▂▁ |
| Hyd Pressure4 | 30 | 0.988 | 96.289 | 13.123 | 52.000 | 86.000 | 96.000 | 102.000 | 142.000 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.989 | 68.190 | 3.538 | 57.000 | 65.600 | 68.200 | 70.600 | 79.400 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.990 | 141.095 | 4.037 | 128.600 | 138.400 | 140.800 | 143.800 | 154.000 | ▁▅▇▃▁ |
| PSC Fill | 23 | 0.991 | 0.195 | 0.118 | 0.000 | 0.100 | 0.180 | 0.260 | 0.620 | ▆▇▃▁▁ |
| Fill Pressure | 22 | 0.991 | 47.922 | 3.178 | 34.600 | 46.000 | 46.400 | 50.000 | 60.400 | ▁▁▇▂▁ |
| Filler Level | 20 | 0.992 | 109.252 | 15.698 | 55.800 | 98.300 | 118.400 | 120.000 | 161.200 | ▁▃▅▇▁ |
| Hyd Pressure2 | 15 | 0.994 | 20.961 | 16.386 | 0.000 | 0.000 | 28.600 | 34.600 | 59.400 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.994 | 20.458 | 15.976 | -1.200 | 0.000 | 27.600 | 33.400 | 50.000 | ▇▁▃▇▁ |
| Temperature | 14 | 0.995 | 65.968 | 1.383 | 63.600 | 65.200 | 65.600 | 66.400 | 76.200 | ▇▃▁▁▁ |
| Oxygen Filler | 12 | 0.995 | 0.047 | 0.047 | 0.002 | 0.022 | 0.033 | 0.060 | 0.400 | ▇▁▁▁▁ |
| Pressure Setpoint | 12 | 0.995 | 47.615 | 2.039 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | ▁▇▁▆▁ |
| Hyd Pressure1 | 11 | 0.996 | 12.438 | 12.433 | -0.800 | 0.000 | 11.400 | 20.200 | 58.000 | ▇▅▂▁▁ |
| Carb Volume | 10 | 0.996 | 5.370 | 0.106 | 5.040 | 5.293 | 5.347 | 5.453 | 5.700 | ▁▆▇▅▁ |
| Carb Rel | 10 | 0.996 | 5.437 | 0.129 | 4.960 | 5.340 | 5.400 | 5.540 | 6.060 | ▁▇▇▂▁ |
| Alch Rel | 9 | 0.996 | 6.897 | 0.505 | 5.280 | 6.540 | 6.560 | 7.240 | 8.620 | ▁▇▂▃▁ |
| Usage cont | 5 | 0.998 | 20.993 | 2.978 | 12.080 | 18.360 | 21.790 | 23.755 | 25.900 | ▁▃▅▃▇ |
| PH | 4 | 0.998 | 8.546 | 0.173 | 7.880 | 8.440 | 8.540 | 8.680 | 9.360 | ▁▅▇▂▁ |
| Mnf Flow | 2 | 0.999 | 24.569 | 119.481 | -100.200 | -100.000 | 65.200 | 140.800 | 229.400 | ▇▁▁▇▂ |
| Carb Flow | 2 | 0.999 | 2468.354 | 1073.696 | 26.000 | 1144.000 | 3028.000 | 3186.000 | 5104.000 | ▂▅▆▇▁ |
| Bowl Setpoint | 2 | 0.999 | 109.327 | 15.303 | 70.000 | 100.000 | 120.000 | 120.000 | 140.000 | ▁▂▃▇▁ |
| Density | 1 | 1.000 | 1.174 | 0.378 | 0.240 | 0.900 | 0.980 | 1.620 | 1.920 | ▁▅▇▂▆ |
| Balling | 1 | 1.000 | 2.198 | 0.931 | -0.170 | 1.496 | 1.648 | 3.292 | 4.012 | ▁▇▇▁▇ |
| Balling Lvl | 1 | 1.000 | 2.050 | 0.870 | 0.000 | 1.380 | 1.480 | 3.140 | 3.660 | ▁▇▂▁▆ |
| Pressure Vacuum | 0 | 1.000 | -5.216 | 0.570 | -6.600 | -5.600 | -5.400 | -5.000 | -3.600 | ▂▇▆▂▁ |
| Air Pressurer | 0 | 1.000 | 142.834 | 1.212 | 140.800 | 142.200 | 142.600 | 143.000 | 148.200 | ▅▇▁▁▁ |
Next, we display the one character variable Brand below and its skimr output below.
Character Variables in Beverage Data
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.953 | 1 | 1 | 0 | 4 | 0 |
After loading the raw Excel file, we find the training dataset has 2571 observations with 32 predictors and 1 response variable PH.
General observations about the skimr results:
MFR is missing the most data: about 8% (212) of its observations are missing. The percentage of missing observations is too high for automated imputation in our opinion.Filler Speed is missing the next most data: about 2.2% (57) of its observations are missing. The percentage is low enough for us to consider imputation.MFR, Brand, all of the other predictors have data completeness rates over 97.8%. Therefore, except for Brand and MFR, an automated data imputation policy can be considered for those columns.PH has 4 missing observations. We will decide to drop those 4 observations as we do not wish to impute the response.Our first transformation below is to add a unique identifier id based on row number to obtain a primary key.
Then we drop the 4 observations which are missing PH values, rearrange the response variable PH and BrandCode for our convenience. Lastly, we also eliminate spaces from the predictor column names to avoid using quote in our code.
raw_StudentData %>%
rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>%
filter( !is.na(PH)) %>% as_tibble() -> sdata_v1
dim(sdata_v1)## [1] 2567 34
Is there any observations with a high number of missing observations?
# include any rows where any column is NA. Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id ) %>% relocate(id)-> missing_train_rows
dim(missing_train_rows)## [1] 529 28
We find there are 529 observations with NA observations. So 20.6% have NA values. We conclude that the NA values are sufficiently well distributed that removal of all observations with NA values is impractical.
In the marginal table below, we investigate if certain observations have a high incidence of NA values amongst their predictors. The answer shows that the incident of multiple
Number of Rows with NA
| num_na | count |
|---|---|
| 1 | 345 |
| 2 | 119 |
| 3 | 45 |
| 4 | 13 |
| 5 | 3 |
| 6 | 2 |
| 7 | 1 |
| 8 | 1 |
## [1] "There are 812 cells with NA values out of 82144 equivalent to: 0.99%"
Overall, only 1 percent of cells have missing values.
The concentration of missing values seems to decline at a geometric ratio suggesting independence among the occurence of missing values.
## [1] 267 33
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| MFR | 31 | 0.884 | 697.801 | 96.400 | 15.600 | 707.000 | 724.600 | 731.450 | 784.800 | ▁▁▁▁▇ |
| Filler Speed | 10 | 0.963 | 3581.385 | 911.187 | 1006.000 | 3812.000 | 3978.000 | 3996.000 | 4020.000 | ▁▁▁▁▇ |
| Fill Ounces | 6 | 0.978 | 23.970 | 0.076 | 23.747 | 23.920 | 23.967 | 24.013 | 24.200 | ▁▅▇▃▁ |
| PSC | 5 | 0.981 | 0.085 | 0.053 | 0.004 | 0.044 | 0.076 | 0.112 | 0.246 | ▆▇▃▂▁ |
| PSC CO2 | 5 | 0.981 | 0.051 | 0.038 | 0.000 | 0.020 | 0.040 | 0.060 | 0.240 | ▇▃▂▁▁ |
| PC Volume | 4 | 0.985 | 0.278 | 0.063 | 0.099 | 0.233 | 0.275 | 0.322 | 0.464 | ▁▆▇▅▁ |
| Carb Pressure1 | 4 | 0.985 | 123.040 | 4.420 | 113.000 | 120.200 | 123.400 | 125.500 | 136.000 | ▃▃▇▂▁ |
| Hyd Pressure4 | 4 | 0.985 | 97.840 | 13.921 | 68.000 | 90.000 | 98.000 | 104.000 | 140.000 | ▅▆▇▂▁ |
| PSC Fill | 3 | 0.989 | 0.190 | 0.113 | 0.020 | 0.100 | 0.180 | 0.260 | 0.620 | ▇▇▃▁▁ |
| Oxygen Filler | 3 | 0.989 | 0.047 | 0.050 | 0.002 | 0.020 | 0.034 | 0.054 | 0.398 | ▇▁▁▁▁ |
| Alch Rel | 3 | 0.989 | 6.907 | 0.498 | 6.400 | 6.540 | 6.580 | 7.180 | 7.820 | ▇▁▂▁▃ |
| Fill Pressure | 2 | 0.993 | 48.141 | 3.440 | 37.800 | 46.000 | 47.800 | 50.200 | 60.200 | ▁▇▇▂▁ |
| Filler Level | 2 | 0.993 | 110.292 | 15.502 | 69.200 | 100.600 | 118.600 | 120.200 | 153.200 | ▂▃▇▇▁ |
| Temperature | 2 | 0.993 | 66.227 | 1.692 | 63.800 | 65.400 | 65.800 | 66.600 | 75.400 | ▇▅▁▁▁ |
| Usage cont | 2 | 0.993 | 20.897 | 2.999 | 12.900 | 18.120 | 21.440 | 23.740 | 24.600 | ▁▃▃▃▇ |
| Pressure Setpoint | 2 | 0.993 | 47.733 | 2.057 | 44.000 | 46.000 | 46.000 | 50.000 | 52.000 | ▁▇▁▆▁ |
| Carb Rel | 2 | 0.993 | 5.440 | 0.127 | 5.180 | 5.340 | 5.400 | 5.560 | 5.740 | ▂▇▂▃▂ |
| Carb Volume | 1 | 0.996 | 5.369 | 0.111 | 5.147 | 5.287 | 5.340 | 5.465 | 5.667 | ▂▇▃▅▁ |
| Carb Temp | 1 | 0.996 | 141.226 | 4.296 | 130.000 | 138.400 | 140.800 | 143.800 | 154.000 | ▁▆▇▃▁ |
| Hyd Pressure2 | 1 | 0.996 | 20.113 | 17.214 | -50.000 | 0.000 | 26.800 | 34.800 | 61.400 | ▁▁▆▇▁ |
| Hyd Pressure3 | 1 | 0.996 | 19.609 | 16.557 | -50.000 | 0.000 | 27.700 | 33.000 | 49.200 | ▁▁▆▃▇ |
| Density | 1 | 0.996 | 1.177 | 0.382 | 0.060 | 0.920 | 0.980 | 1.600 | 1.840 | ▁▁▇▁▅ |
| Balling | 1 | 0.996 | 2.203 | 0.925 | 0.902 | 1.498 | 1.648 | 3.242 | 3.788 | ▅▇▁▂▅ |
| Pressure Vacuum | 1 | 0.996 | -5.174 | 0.582 | -6.400 | -5.600 | -5.200 | -4.800 | -3.600 | ▁▇▆▃▁ |
| Bowl Setpoint | 1 | 0.996 | 109.624 | 15.017 | 70.000 | 100.000 | 120.000 | 120.000 | 130.000 | ▁▂▁▃▇ |
| Air Pressurer | 1 | 0.996 | 142.831 | 1.231 | 141.200 | 142.200 | 142.600 | 142.800 | 147.200 | ▅▇▁▁▁ |
| Carb Pressure | 0 | 1.000 | 68.255 | 3.857 | 60.200 | 65.300 | 68.000 | 70.600 | 77.600 | ▃▆▇▃▂ |
| Mnf Flow | 0 | 1.000 | 21.034 | 117.756 | -100.200 | -100.000 | 0.200 | 141.300 | 220.400 | ▇▁▁▆▂ |
| Hyd Pressure1 | 0 | 1.000 | 12.010 | 13.531 | -50.000 | 0.000 | 10.400 | 20.400 | 50.000 | ▁▁▇▆▂ |
| Carb Flow | 0 | 1.000 | 2408.644 | 1161.364 | 0.000 | 1083.000 | 3038.000 | 3215.000 | 3858.000 | ▂▃▁▆▇ |
| Balling Lvl | 0 | 1.000 | 2.051 | 0.883 | 0.000 | 1.380 | 1.480 | 3.080 | 3.420 | ▁▃▇▁▇ |
Character Variables in Beverage Data
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 8 | 0.97 | 1 | 1 | 0 | 4 | 0 |
## [1] 267 34
# include any rows where any column is NA. Always include id column
#
# NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
# for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id ) %>% relocate(id)-> missing_test_rows
dim(missing_test_rows)## [1] 67 28
Number of Rows with NA
| num_na | count |
|---|---|
| 1 | 50 |
| 2 | 11 |
| 3 | 3 |
| 4 | 1 |
| 8 | 1 |
| 14 | 1 |
## [1] "There are 107 cells with NA values (excluding PH) out of 8544 equivalent to: 1.25%"
We conclude that the distribution of missing values in the test set is:
The scatterplots of the predictor variables against the dependent variable PH indicate no clear linear relationship with any predictor variable. The scatterplots indicate some variable measurements are not continuous and instead fall on specific values on the x-axis.
The density plots of each predictor variable show many predictor variables with non-normal distributions.
The Shapiro-Wilk test of normality is performed on all the numeric predictor variables and the resulting list confirms no p-value above 0.05 and thus no predictor variable follows a normal distribution.
## [1] 0
The distribution of the Brand Code, the lone non-numeric predictor variable, shows a high count of brand B, comprising almost half of the samples. Also note, some values are missing for the Brand Code variable.
The boxplot of PH values by Brand Code certainly captures overlap of the middle two quartiles, which seems reasonable given the PH range is rather narrow. Of note, the median values of the 4 labeled brands and the unlabeled group do appear distinct besides Brand A and the unlabeled.
To find significant variables for the models, the VSURF algorithm is applied to the provided dataset. The algorithm identifies 14 variables as significant to the predictions.
Next we partition the student evaluation data set by training and test sets. We will retain the terminology of test set to refer to the validation data set used for model performance assessment outside of the cross validation process.
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )
sdata_test = sdata_v1[-train_indices , ]
sdata_train = sdata_v1[ train_indices , ]
sdataY_test = sdata_v1$PH[-train_indices ]
sdataY_train = sdata_v1$PH[ train_indices ]We conclude there are no near zero variance predictors or constant predictors to be dropped from the training data set.
No Near Zero Variance Predictors in Training
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| id | 1.00 | 100.00 | FALSE | FALSE |
| PH | 1.05 | 2.48 | FALSE | FALSE |
| CarbVolume | 1.03 | 4.72 | FALSE | FALSE |
| FillOunces | 1.18 | 4.38 | FALSE | FALSE |
| PCVolume | 1.19 | 21.22 | FALSE | FALSE |
| CarbPressure | 1.10 | 5.06 | FALSE | FALSE |
| CarbTemp | 1.04 | 5.99 | FALSE | FALSE |
| PSC | 1.14 | 6.23 | FALSE | FALSE |
| PSCFill | 1.11 | 1.56 | FALSE | FALSE |
| PSCCO2 | 1.08 | 0.63 | FALSE | FALSE |
| MnfFlow | 1.08 | 21.75 | FALSE | FALSE |
| CarbPressure1 | 1.02 | 6.37 | FALSE | FALSE |
| FillPressure | 1.78 | 5.06 | FALSE | FALSE |
| HydPressure1 | 31.86 | 11.58 | FALSE | FALSE |
| HydPressure2 | 7.63 | 9.39 | FALSE | FALSE |
| HydPressure3 | 12.71 | 8.91 | FALSE | FALSE |
| HydPressure4 | 1.02 | 1.90 | FALSE | FALSE |
| FillerLevel | 1.05 | 12.90 | FALSE | FALSE |
| FillerSpeed | 1.00 | 10.41 | FALSE | FALSE |
| Temperature | 1.05 | 2.63 | FALSE | FALSE |
| Usagecont | 1.12 | 22.43 | FALSE | FALSE |
| CarbFlow | 1.33 | 23.21 | FALSE | FALSE |
| Density | 1.19 | 3.65 | FALSE | FALSE |
| MFR | 1.04 | 25.94 | FALSE | FALSE |
| Balling | 1.24 | 9.34 | FALSE | FALSE |
| PressureVacuum | 1.43 | 0.78 | FALSE | FALSE |
| OxygenFiller | 1.32 | 15.67 | FALSE | FALSE |
| BowlSetpoint | 2.80 | 0.49 | FALSE | FALSE |
| PressureSetpoint | 1.33 | 0.39 | FALSE | FALSE |
| AirPressurer | 1.03 | 1.41 | FALSE | FALSE |
| AlchRel | 1.10 | 2.53 | FALSE | FALSE |
| CarbRel | 1.02 | 1.95 | FALSE | FALSE |
| BallingLvl | 1.23 | 3.89 | FALSE | FALSE |
| BrandCode | 2.01 | 0.19 | FALSE | FALSE |
Next we identify and fix zeros and outliers in the predictors before imputing missing values.
The reason is that KNN algorithm may be affected by outliers. The table below uses skim to report the minimum, maximum, median, mean and standard deviation for each predictor. In addition, we construct Z-score metrics that tells us how many standard deviations is the minimum and maximum.
\[ \text{zscore_min}(X_i) = \frac{ min(X_i) - \mu(X_i) }{\sigma(X_i)}\] \[\text{zscore_max}(X_i) = \frac{ max(X_i) - \mu(X_i) }{\sigma(X_i )}\]
Predictors with Outliers by Z-Score
| skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.hist | zscore_min | zscore_max |
|---|---|---|---|---|---|---|---|
| MFR | 156 | 0.92 | 704.15 | 74.72 | ▁▁▁▂▇ | -9.00 | 2.20 |
| OxygenFiller | 7 | 1.00 | 0.05 | 0.04 | ▇▁▁▁▁ | -0.97 | 7.88 |
| Temperature | 8 | 1.00 | 65.95 | 1.36 | ▇▃▁▁▁ | -1.73 | 7.55 |
| CarbRel | 6 | 1.00 | 5.44 | 0.13 | ▁▇▇▂▁ | -3.70 | 4.84 |
| AirPressurer | 0 | 1.00 | 142.83 | 1.20 | ▇▇▁▁▁ | -1.52 | 4.48 |
| PSCCO2 | 29 | 0.99 | 0.06 | 0.04 | ▇▅▂▁▁ | -1.31 | 4.31 |
| FillPressure | 15 | 0.99 | 47.91 | 3.14 | ▁▁▇▅▁ | -4.23 | 3.72 |
Using scatterplots of each of the 7 identified predictors below, we see no obvious data errors exist in the distribution of the predictors with outlier values.
Isolated points exist but lie within a reasonsable proximity to neighboring points in the scatter plot. Therefore no data corrections for outliers will be applied to the training data.
The code below will transform the special case situations of data pre-processing identified earlier:
E.These rules are applied consistently to be the training, validation and test data sets prior to subsequent pre processing.
We apply the caret pre processing function to that training, validation and test data sets.
# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected. They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale", "knnImpute") )
# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc, as.data.frame(sdata_v2_train ) )
# Becomes the final version of test data for validation
sdata_test_pp = predict( preProcFunc, as.data.frame(sdata_v2_test) )
# Need to generate predictions based on test data without known response
edata_test_pp = predict( preProcFunc, as.data.frame(edata_v2) )
sdata_trainX_pp = sdata_train_pp %>% select( -PH)
sdata_testX_pp = sdata_test_pp %>% select(-PH)
edata_testX_pp = edata_test_pp %>% select( -PH)| Name | sdata_train_pp |
| Number of rows | 2055 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 33 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1297.57 | 751.06 | 1.00 | 644.50 | 1296.00 | 1961.00 | 2568.00 | ▇▇▇▇▇ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▁▁ |
| CarbVolume | 0 | 1 | 0.00 | 1.00 | -3.42 | -0.71 | -0.26 | 0.86 | 2.90 | ▁▃▇▅▁ |
| FillOunces | 0 | 1 | 0.00 | 0.99 | -3.88 | -0.62 | -0.01 | 0.60 | 3.91 | ▁▂▇▂▁ |
| PCVolume | 0 | 1 | 0.01 | 1.00 | -3.72 | -0.60 | -0.07 | 0.59 | 3.01 | ▁▂▇▃▁ |
| CarbPressure | 0 | 1 | 0.01 | 1.00 | -3.11 | -0.72 | 0.04 | 0.70 | 2.88 | ▁▃▇▅▁ |
| CarbTemp | 0 | 1 | 0.01 | 1.00 | -3.47 | -0.65 | 0.01 | 0.69 | 2.90 | ▁▂▇▅▁ |
| PSC | 0 | 1 | 0.00 | 0.99 | -2.75 | -0.65 | 0.01 | 0.66 | 2.83 | ▁▅▇▅▁ |
| PSCFill | 0 | 1 | 0.00 | 1.00 | -1.66 | -0.81 | -0.13 | 0.54 | 3.60 | ▆▇▃▁▁ |
| PSCCO2 | 0 | 1 | 0.00 | 0.99 | -1.31 | -0.84 | -0.37 | 0.33 | 4.31 | ▇▅▂▁▁ |
| MnfFlow | 0 | 1 | 0.00 | 1.00 | -1.05 | -1.05 | 0.50 | 0.97 | 1.58 | ▇▁▁▆▂ |
| CarbPressure1 | 0 | 1 | -0.01 | 1.00 | -3.63 | -0.77 | 0.13 | 0.60 | 3.77 | ▁▃▇▂▁ |
| FillPressure | 0 | 1 | 0.00 | 1.00 | -5.38 | -0.58 | -0.45 | 0.70 | 3.23 | ▁▁▇▆▁ |
| HydPressure1 | 0 | 1 | 0.00 | 1.00 | -1.07 | -1.01 | -0.08 | 0.65 | 3.66 | ▇▅▂▁▁ |
| HydPressure2 | 0 | 1 | 0.00 | 1.00 | -1.29 | -1.29 | 0.47 | 0.83 | 2.34 | ▇▂▇▅▁ |
| HydPressure3 | 0 | 1 | 0.00 | 1.00 | -1.37 | -1.29 | 0.44 | 0.80 | 1.84 | ▇▁▃▇▁ |
| HydPressure4 | 0 | 1 | 0.01 | 1.00 | -3.40 | -0.76 | 0.07 | 0.52 | 2.82 | ▁▅▇▇▁ |
| FillerLevel | 0 | 1 | 0.00 | 1.00 | -2.85 | -0.72 | 0.54 | 0.68 | 4.30 | ▁▅▇▁▁ |
| FillerSpeed | 0 | 1 | -0.05 | 1.05 | -3.44 | 0.20 | 0.41 | 0.44 | 0.51 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 0.00 | 1.00 | -1.91 | -0.57 | -0.25 | 0.38 | 6.50 | ▇▆▁▁▁ |
| Usagecont | 0 | 1 | 0.00 | 1.00 | -2.46 | -0.92 | 0.20 | 0.95 | 1.83 | ▁▇▅▇▆ |
| CarbFlow | 0 | 1 | 0.00 | 1.00 | -2.02 | -1.33 | 0.52 | 0.69 | 2.99 | ▃▁▇▁▁ |
| Density | 0 | 1 | 0.00 | 1.00 | -4.85 | -0.68 | -0.41 | 1.18 | 1.71 | ▁▁▁▇▅ |
| MFR | 0 | 1 | 0.00 | 1.00 | -6.70 | -0.02 | 0.29 | 0.41 | 3.35 | ▁▁▁▇▁ |
| Balling | 0 | 1 | 0.00 | 1.00 | -6.22 | -0.74 | -0.50 | 1.19 | 1.68 | ▁▁▁▇▅ |
| PressureVacuum | 0 | 1 | 0.00 | 1.00 | -2.42 | -0.67 | -0.31 | 0.39 | 2.84 | ▂▇▅▃▂ |
| OxygenFiller | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.35 | 0.03 | 0.62 | 3.21 | ▃▇▆▂▁ |
| BowlSetpoint | 0 | 1 | 0.00 | 1.00 | -2.39 | -0.73 | 0.70 | 0.70 | 2.40 | ▁▃▃▇▁ |
| PressureSetpoint | 0 | 1 | 0.00 | 1.00 | -1.94 | -0.77 | -0.77 | 1.16 | 1.96 | ▁▇▁▆▁ |
| AirPressurer | 0 | 1 | 0.00 | 1.00 | -1.58 | -0.53 | -0.18 | 0.16 | 4.37 | ▅▇▁▁▁ |
| AlchRel | 0 | 1 | 0.00 | 1.00 | -5.09 | -0.72 | -0.62 | 0.79 | 2.76 | ▁▁▇▂▃ |
| CarbRel | 0 | 1 | 0.00 | 1.00 | -4.28 | -0.75 | -0.26 | 0.97 | 4.21 | ▁▂▇▅▁ |
| BallingLvl | 0 | 1 | 0.00 | 1.00 | -2.36 | -0.77 | -0.66 | 1.26 | 1.86 | ▁▇▂▁▆ |
| Name | sdata_test_pp |
| Number of rows | 512 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 33 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1237.47 | 705.94 | 5.00 | 638.00 | 1248.50 | 1803.75 | 2571.00 | ▇▇▇▇▆ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.98 | 8.44 | 8.54 | 8.68 | 8.94 | ▁▃▇▇▃ |
| CarbVolume | 0 | 1 | 0.01 | 1.01 | -3.12 | -0.71 | -0.19 | 0.80 | 2.52 | ▁▃▇▅▂ |
| FillOunces | 0 | 1 | 0.03 | 1.01 | -3.20 | -0.55 | -0.01 | 0.60 | 3.99 | ▁▅▇▂▁ |
| PCVolume | 0 | 1 | -0.01 | 0.93 | -3.47 | -0.55 | -0.04 | 0.58 | 2.79 | ▁▂▇▅▁ |
| CarbPressure | 0 | 1 | 0.00 | 0.99 | -3.54 | -0.72 | -0.02 | 0.71 | 2.74 | ▁▂▇▇▁ |
| CarbTemp | 0 | 1 | -0.02 | 1.00 | -3.47 | -0.65 | -0.09 | 0.64 | 2.90 | ▁▂▇▅▁ |
| PSC | 0 | 1 | -0.05 | 1.06 | -2.53 | -0.81 | -0.08 | 0.67 | 2.76 | ▂▇▇▅▂ |
| PSCFill | 0 | 1 | -0.02 | 0.99 | -1.66 | -0.81 | -0.13 | 0.54 | 3.43 | ▆▇▃▂▁ |
| PSCCO2 | 0 | 1 | 0.07 | 1.03 | -1.31 | -0.84 | -0.37 | 0.56 | 4.31 | ▇▅▂▁▁ |
| MnfFlow | 0 | 1 | -0.06 | 0.98 | -1.05 | -1.05 | -0.22 | 0.92 | 1.69 | ▇▁▁▇▁ |
| CarbPressure1 | 0 | 1 | -0.02 | 1.06 | -3.46 | -0.90 | 0.09 | 0.65 | 3.34 | ▁▅▇▅▁ |
| FillPressure | 0 | 1 | 0.02 | 1.04 | -3.82 | -0.58 | -0.45 | 0.70 | 3.41 | ▁▁▇▅▁ |
| HydPressure1 | 0 | 1 | -0.03 | 1.01 | -1.07 | -1.01 | -0.12 | 0.57 | 3.21 | ▇▅▂▁▁ |
| HydPressure2 | 0 | 1 | -0.04 | 1.00 | -1.29 | -1.29 | 0.42 | 0.81 | 2.08 | ▇▁▆▆▁ |
| HydPressure3 | 0 | 1 | -0.05 | 1.00 | -1.37 | -1.29 | 0.40 | 0.79 | 1.81 | ▇▁▃▇▁ |
| HydPressure4 | 0 | 1 | 0.07 | 0.99 | -2.16 | -0.58 | 0.23 | 0.52 | 2.82 | ▂▃▇▂▁ |
| FillerLevel | 0 | 1 | -0.07 | 1.05 | -2.48 | -1.22 | 0.57 | 0.69 | 3.08 | ▂▃▇▁▁ |
| FillerSpeed | 0 | 1 | -0.16 | 1.18 | -3.43 | 0.05 | 0.35 | 0.44 | 0.48 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 0.06 | 1.05 | -1.91 | -0.57 | -0.10 | 0.38 | 6.30 | ▆▇▁▁▁ |
| Usagecont | 0 | 1 | -0.05 | 1.01 | -2.54 | -1.04 | 0.19 | 0.94 | 1.78 | ▁▇▅▇▇ |
| CarbFlow | 0 | 1 | 0.08 | 1.00 | -2.02 | -1.25 | 0.54 | 0.74 | 1.43 | ▃▁▁▇▅ |
| Density | 0 | 1 | 0.02 | 1.00 | -4.14 | -0.68 | -0.41 | 1.18 | 1.68 | ▁▁▆▇▇ |
| MFR | 0 | 1 | -0.01 | 0.97 | -6.40 | -0.04 | 0.29 | 0.39 | 2.91 | ▁▁▁▇▁ |
| Balling | 0 | 1 | -0.01 | 1.01 | -3.70 | -0.82 | -0.50 | 1.19 | 1.63 | ▁▁▇▂▆ |
| PressureVacuum | 0 | 1 | 0.04 | 1.00 | -2.42 | -0.67 | 0.04 | 0.39 | 2.84 | ▂▇▆▃▂ |
| OxygenFiller | 0 | 1 | 0.09 | 0.98 | -1.98 | -0.27 | 0.08 | 0.69 | 3.21 | ▃▇▇▂▁ |
| BowlSetpoint | 0 | 1 | -0.09 | 1.08 | -2.39 | -1.35 | 0.70 | 0.70 | 2.40 | ▂▃▂▇▁ |
| PressureSetpoint | 0 | 1 | 0.02 | 0.98 | -1.94 | -0.77 | -0.77 | 1.16 | 1.28 | ▁▇▁▁▆ |
| AirPressurer | 0 | 1 | 0.04 | 1.05 | -1.76 | -0.53 | -0.18 | 0.16 | 3.91 | ▂▇▁▁▁ |
| AlchRel | 0 | 1 | 0.00 | 0.99 | -1.08 | -0.72 | -0.67 | 0.68 | 2.74 | ▇▁▂▃▁ |
| CarbRel | 0 | 1 | -0.01 | 1.00 | -3.67 | -0.75 | -0.26 | 0.83 | 3.02 | ▁▂▇▅▁ |
| BallingLvl | 0 | 1 | 0.01 | 1.01 | -2.36 | -0.77 | -0.68 | 1.26 | 1.65 | ▁▆▇▁▇ |
| Name | edata_test_pp |
| Number of rows | 267 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| logical | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| PH | 267 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 134.00 | 77.22 | 1.00 | 67.50 | 134.00 | 200.50 | 267.00 | ▇▇▇▇▇ |
| CarbVolume | 0 | 1 | -0.02 | 1.04 | -2.23 | -0.81 | -0.26 | 0.89 | 2.63 | ▂▇▅▅▁ |
| FillOunces | 0 | 1 | -0.04 | 0.86 | -2.60 | -0.62 | -0.09 | 0.45 | 2.60 | ▁▅▇▃▁ |
| PCVolume | 0 | 1 | 0.01 | 1.04 | -3.28 | -0.70 | 0.01 | 0.75 | 2.82 | ▁▃▇▇▁ |
| CarbPressure | 0 | 1 | 0.01 | 1.08 | -2.42 | -0.80 | -0.02 | 0.70 | 2.46 | ▂▆▇▅▂ |
| CarbTemp | 0 | 1 | 0.03 | 1.06 | -3.04 | -0.65 | -0.04 | 0.71 | 2.90 | ▁▃▇▃▁ |
| PSC | 0 | 1 | -0.01 | 1.08 | -2.53 | -0.75 | 0.01 | 0.66 | 2.55 | ▂▅▇▃▂ |
| PSCFill | 0 | 1 | -0.05 | 0.96 | -1.49 | -0.73 | -0.13 | 0.46 | 3.60 | ▆▇▃▁▁ |
| PSCCO2 | 0 | 1 | -0.12 | 0.89 | -1.31 | -0.84 | -0.37 | 0.09 | 4.31 | ▇▃▂▁▁ |
| MnfFlow | 0 | 1 | -0.04 | 0.98 | -1.05 | -1.05 | -0.22 | 0.96 | 1.62 | ▇▁▁▆▂ |
| CarbPressure1 | 0 | 1 | 0.09 | 0.94 | -2.05 | -0.53 | 0.17 | 0.60 | 2.87 | ▂▃▇▂▁ |
| FillPressure | 0 | 1 | 0.06 | 1.10 | -3.82 | -0.58 | 0.01 | 0.76 | 3.37 | ▁▁▇▆▁ |
| HydPressure1 | 0 | 1 | -0.04 | 1.09 | -5.04 | -1.01 | -0.17 | 0.63 | 3.02 | ▁▁▇▆▂ |
| HydPressure2 | 0 | 1 | -0.07 | 1.05 | -4.34 | -1.29 | 0.35 | 0.83 | 2.46 | ▁▁▆▇▁ |
| HydPressure3 | 0 | 1 | -0.07 | 1.04 | -4.42 | -1.29 | 0.44 | 0.77 | 1.79 | ▁▁▆▃▇ |
| HydPressure4 | 0 | 1 | 0.13 | 1.05 | -2.63 | -0.41 | 0.23 | 0.66 | 2.73 | ▁▂▇▃▂ |
| FillerLevel | 0 | 1 | 0.06 | 1.01 | -2.32 | -0.66 | 0.56 | 0.69 | 3.51 | ▃▃▇▁▁ |
| FillerSpeed | 0 | 1 | -0.23 | 1.25 | -3.43 | 0.04 | 0.28 | 0.44 | 0.48 | ▁▁▁▁▇ |
| Temperature | 0 | 1 | 0.21 | 1.20 | -1.74 | -0.40 | -0.09 | 0.53 | 6.09 | ▇▇▂▁▁ |
| Usagecont | 0 | 1 | -0.05 | 1.01 | -2.37 | -1.02 | 0.06 | 0.94 | 1.29 | ▁▅▃▂▇ |
| CarbFlow | 0 | 1 | -0.03 | 1.07 | -2.02 | -1.36 | 0.53 | 0.73 | 1.46 | ▅▁▁▇▃ |
| Density | 0 | 1 | -0.02 | 1.14 | -9.22 | -0.64 | -0.41 | 1.14 | 1.58 | ▁▁▁▃▇ |
| MFR | 0 | 1 | -0.05 | 1.16 | -6.71 | -0.01 | 0.29 | 0.41 | 1.50 | ▁▁▁▁▇ |
| Balling | 0 | 1 | 0.00 | 1.00 | -1.98 | -0.74 | -0.50 | 1.16 | 1.54 | ▁▇▃▁▇ |
| PressureVacuum | 0 | 1 | 0.08 | 1.02 | -2.07 | -0.67 | 0.04 | 0.74 | 2.84 | ▁▇▆▃▁ |
| OxygenFiller | 0 | 1 | 0.02 | 0.99 | -1.98 | -0.41 | 0.05 | 0.55 | 3.20 | ▅▇▇▂▁ |
| BowlSetpoint | 0 | 1 | 0.00 | 0.99 | -2.39 | -0.73 | 0.70 | 0.70 | 1.52 | ▂▂▃▇▁ |
| PressureSetpoint | 0 | 1 | 0.06 | 1.00 | -1.94 | -0.77 | -0.77 | 1.16 | 1.96 | ▁▇▁▆▁ |
| AirPressurer | 0 | 1 | 0.01 | 1.03 | -1.40 | -0.53 | -0.18 | -0.01 | 3.60 | ▅▇▁▁▁ |
| AlchRel | 0 | 1 | 0.01 | 0.98 | -1.08 | -0.72 | -0.62 | 0.68 | 1.74 | ▇▁▁▁▃ |
| CarbRel | 0 | 1 | 0.02 | 0.99 | -2.14 | -0.75 | -0.26 | 0.97 | 2.24 | ▁▇▃▃▃ |
| BallingLvl | 0 | 1 | 0.00 | 1.02 | -2.36 | -0.77 | -0.66 | 1.19 | 1.58 | ▁▃▇▁▇ |
First, we consider a correlation matrix where pairwise complete observations are allowed for use in computing correlations. Due to the low percentage of imcomplete cells in the dataset, we prefer this dropping observations with any incomplete columns.
The correlation matrix below uses hierarchical clustering of the predictors to form 6 groups of variables using the corrplot package. This was the most efficient way to gain insight on the related clusters of variables.
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )We compare the correlations to the cleaned pre-processed training data.
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )Our last step of preprocessing is to export the data sets. This need only but done once.
readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )
readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )
readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )The above data files have the following properties and relationships to the original raw files:
Pre-Processed Data Files
| filename | primary_key | response | numeric_columns | num_rows | file_source |
|---|---|---|---|---|---|
| sdata_train_pp.csv | id - consistent with sdata_test_pp | PH - unchanged | BoxCox, center, scaled, imputed | 2055 | StudentData.xlsx |
| sdata_test_pp.csv | id - consistent with sdata_train_pp | PH - unchanged | BoxCox, center, scaled, imputed | 512 | StudentData.xlsx |
| edata_test_pp.csv | id - standalone | PH - unchanged | BoxCox, center, scaled, imputed | 267 | StudentEvaluation.xlsx |
An important note about the id column. The id in the file edata_test_pp.csv is the row number of the same observation in StudentEvaluation.xlsx. But the same is not true for sdata_train_pp.csv and sdata_test_pp.csv because they are sampled from the source file StudentData.xlsx. The 10th row of sdata_test_pp.csv is not necessarily the 10th row of StudentData.xlsx rather one should use the id field as the corresponding row number of the original file.
The csv files listed above can be loaded and used for model building with no further transformation. Since they contain a primary key and the response variables and predictors, they have the necessary information to fit a model and generate predictions on training, validation and test samples.
We load the csv file for the training data set below and display the first few column characteristics from the file.
sdata_train_pp = read_csv("sdata_train_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% as_tibble()
skim(sdata_train_pp) %>% yank("numeric") %>% head(n=5)Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1297.57 | 751.06 | 1.00 | 644.50 | 1296.00 | 1961.00 | 2568.00 | ▇▇▇▇▇ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▁▁ |
| CarbVolume | 0 | 1 | 0.00 | 1.00 | -3.42 | -0.71 | -0.26 | 0.86 | 2.90 | ▁▃▇▅▁ |
| FillOunces | 0 | 1 | 0.00 | 0.99 | -3.88 | -0.62 | -0.01 | 0.60 | 3.91 | ▁▂▇▂▁ |
| PCVolume | 0 | 1 | 0.01 | 1.00 | -3.72 | -0.60 | -0.07 | 0.59 | 3.01 | ▁▂▇▃▁ |
skim(sdata_train_pp) %>% yank("character") ## NULL
Now we load the validation data test (20% of the original training observations) below and inspect its initial columns.
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 1237.47 | 705.94 | 5.00 | 638.00 | 1248.50 | 1803.75 | 2571.00 | ▇▇▇▇▆ |
| PH | 0 | 1 | 8.55 | 0.17 | 7.98 | 8.44 | 8.54 | 8.68 | 8.94 | ▁▃▇▇▃ |
| CarbVolume | 0 | 1 | 0.01 | 1.01 | -3.12 | -0.71 | -0.19 | 0.80 | 2.52 | ▁▃▇▅▂ |
| FillOunces | 0 | 1 | 0.03 | 1.01 | -3.20 | -0.55 | -0.01 | 0.60 | 3.99 | ▁▅▇▂▁ |
| PCVolume | 0 | 1 | -0.01 | 0.93 | -3.47 | -0.55 | -0.04 | 0.58 | 2.79 | ▁▂▇▅▁ |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
Now we load the test data test below and inspect its initial columns.
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 134.00 | 77.22 | 1.00 | 67.50 | 134.00 | 200.50 | 267.00 | ▇▇▇▇▇ |
| CarbVolume | 0 | 1 | -0.02 | 1.04 | -2.23 | -0.81 | -0.26 | 0.89 | 2.63 | ▂▇▅▅▁ |
| FillOunces | 0 | 1 | -0.04 | 0.86 | -2.60 | -0.62 | -0.09 | 0.45 | 2.60 | ▁▅▇▃▁ |
| PCVolume | 0 | 1 | 0.01 | 1.04 | -3.28 | -0.70 | 0.01 | 0.75 | 2.82 | ▁▃▇▇▁ |
| CarbPressure | 0 | 1 | 0.01 | 1.08 | -2.42 | -0.80 | -0.02 | 0.70 | 2.46 | ▂▆▇▅▂ |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BrandCode | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
We demonstrate building a linear regression model using the pre-processed training data set.
We summarize all the R code used in this project in this appendix for ease of reading.
knitr::opts_chunk$set(echo = FALSE)
# Conditional Code Evaluation is managed here.
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(GGally) # for ggpairs
library(readxl) # to parse Excel workbooks
library(openxlsx) # for writing xlsx files
library(corrplot)
library(RColorBrewer)
library(caret)
library(VSURF)
library(tictoc)
tic()
.code-bg {
background-color: lightgray;
}
raw_StudentData <- read_excel("StudentData.xlsx")
dim(raw_StudentData)
skim(raw_StudentData) %>%
yank("numeric") %>%
arrange( complete_rate) %>%
kable( digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(raw_StudentData) %>% yank("character") %>%
kable( caption = "Character Variables in Beverage Data" , digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentData %>%
rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>%
filter( !is.na(PH)) %>% as_tibble() -> sdata_v1
dim(sdata_v1)
# include any rows where any column is NA. Always include id column
sdata_v1 %>% dplyr::filter_all( any_vars(is.na(.))) -> x1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
x1 %>% select_if( ~ any(is.na(.) ) ) -> x2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
x2 %>% mutate( id = x1$id ) %>% relocate(id)-> missing_train_rows
dim(missing_train_rows)
missing_train_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_train_summary
missing_train_summary %>% group_by(num_na) %>%
summarize( count = n()) %>%
kable( caption = "Number of Rows with NA") %>%
kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells = sum( missing_train_summary$num_na)
num_cells = nrow(sdata_v1) * 32 # excludes the response variable PH since we dropped its values.
print(paste0("There are ", num_na_cells , " cells with NA values out of ", num_cells, " equivalent to: " , sprintf("%.2f%%", 100 * num_na_cells / num_cells ) ) )
raw_StudentEvaluation <- read_excel("StudentEvaluation.xlsx")
dim(raw_StudentEvaluation)
skim(raw_StudentEvaluation) %>%
yank("numeric") %>%
arrange( complete_rate) %>%
kable( digits = 3 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(raw_StudentEvaluation) %>% yank("character") %>%
kable( caption = "Character Variables in Beverage Data" , digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
raw_StudentEvaluation %>%
rename_with(~ str_replace(., " ", ""), everything() ) %>%
mutate( id = row_number() ) %>%
relocate(id) %>%
relocate(PH, .after = id) %>%
relocate(`BrandCode`, .after = `BallingLvl` ) %>% as_tibble() -> edata_v1
dim(edata_v1)
# include any rows where any column is NA. Always include id column
#
# NOTE THAT WE EXCLUDE PH because the response variable (by design) is NA
# for all observations in the test set.
edata_v1 %>% select(-PH) %>% dplyr::filter_all( any_vars(is.na(.))) -> ex1
# only include those columns where some row has NA in that column
# this dataset has no id column / so no primary key
ex1 %>% select_if( ~ any(is.na(.) ) ) -> ex2
# But the order of the missing observations is same in x1 and x2
# so add back the id and put it on the left side of tibble
ex2 %>% mutate( id = ex1$id ) %>% relocate(id)-> missing_test_rows
dim(missing_test_rows)
missing_test_rows %>% mutate( num_na = rowSums(is.na(.))) -> missing_test_summary
missing_test_summary %>% group_by(num_na) %>%
summarize( count = n()) %>%
kable( caption = "Number of Rows with NA") %>%
kable_styling(position = "left", bootstrap_options = c("hover", "striped"))
num_na_cells_test = sum( missing_test_summary$num_na)
num_cells_test = nrow(edata_v1) * 32 # excludes the response variable PH since we dropped its values.
print(paste0("There are ", num_na_cells_test , " cells with NA values (excluding PH) out of ", num_cells_test, " equivalent to: " ,
sprintf("%.2f%%", 100 * num_na_cells_test / num_cells_test ) ) )
skim_sdata = skim(sdata_v1)
skim_edata = skim(edata_v1)
skim_sdata %>% select( skim_variable, n_missing, complete_rate) %>% inner_join(skim_edata %>% select(skim_variable, n_missing, complete_rate), by = c("skim_variable") ) %>% filter(skim_variable != 'PH' ) -> skim_complete_rates
skim_complete_rates %>% rename( training_complete = complete_rate.x, testing_complete = complete_rate.y) -> skim_complete_rates
skim_complete_rates %>% as_tibble() %>% ggplot(aes(x=training_complete, y = testing_complete)) +
geom_point() + geom_smooth(method = "lm") + theme(aspect.ratio = 1) + lims( x = c(0.95, 1), y = c(0.95, 1)) +
labs(title = "Completion Rate by Predictor between Training and Test Sets")
# Feature plot for the numeric predictor variables against the result variable PH
cols <- sdata_v1 %>%
select(-c('id', 'BrandCode', 'PH')) %>% colnames()
featurePlot(sdata_v1[,cols],
sdata_v1$PH,
plot="scatter",
type = c("p", "smooth"),
span = .5,
layout=c(6,6))
#https://statisticsglobe.com/histogram-density-for-each-column-data-frame-r
data_long <- sdata_v1[,cols] %>%
pivot_longer(cols) %>%
as.data.frame()
ggp2 <- ggplot(data_long, aes(x = value)) +
geom_density() +
facet_wrap(~ name, scales="free") +
labs(title = "Density Plot of Predictor Variables")
ggp2
# http://www.sthda.com/english/wiki/normality-test-in-r
# From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different from normal distribution. In other words, we can assume the normality.
shap_test_res <- lapply(sdata_v1[,cols], shapiro.test)
# Result: None of the numeric variables are normally distributed.
# https://stackoverflow.com/questions/62306712/how-to-select-only-the-p-value-0-05-after-performing-shapiro-wilk-test-in-rstud
subset_vector <- sapply(shap_test_res, function(x) x$p.value > .05)
results_subset <- shap_test_res[subset_vector]
length(results_subset)
ggplot(data = sdata_v1) +
geom_bar(mapping = aes(x = `BrandCode`)) +
labs(title = "Distribution of Brand Code")
ggplot(data = sdata_v1, mapping = aes(x = `BrandCode`, y = PH)) +
geom_boxplot() +
labs(title = "Boxplot of Brand Code")
sdata_v1_no_nas <- sdata_v1 %>% drop_na()
bev.vsurf <- VSURF(sdata_v1_no_nas[,cols],
sdata_v1_no_nas$PH,
ntree = 10,
nfor.thres = 20,
nfor.interp = 10, nfor.pred = 10)
bev.vsurf
summary(bev.vsurf)
bev.vsurf$varselect.pred
colnames(bev_data_no_nas[,cols])
set.seed(19372)
train_indices = createDataPartition( sdata_v1$PH , times = 1, list = FALSE, p = 0.80 )
sdata_test = sdata_v1[-train_indices , ]
sdata_train = sdata_v1[ train_indices , ]
sdataY_test = sdata_v1$PH[-train_indices ]
sdataY_train = sdata_v1$PH[ train_indices ]
nearZeroVar(sdata_train, saveMetrics = TRUE) %>%
kable(caption = "No Near Zero Variance Predictors in Training", digits = 2 ) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position = "left")
skim(sdata_train) %>%
mutate( zscore_min = ((numeric.p0 - numeric.mean) / (numeric.sd) ) ,
zscore_max = ((numeric.p100 - numeric.mean) / (numeric.sd) ) ,
zscore_extreme = ifelse( abs(zscore_min) > abs(zscore_max) , abs(zscore_min), zscore_max )
) %>%
filter( zscore_extreme > 4 , skim_variable != 'PH') %>%
arrange( desc(zscore_extreme)) %>%
dplyr::select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, numeric.hist, zscore_min, zscore_max ) %>%
kable(digits = 2, caption = "Predictors with Outliers by Z-Score" ) %>%
kable_styling(bootstrap_options = c("hover" , "striped"), position = "left")
pl1 = sdata_train %>% ggplot(aes(x=PH, y = MFR)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl2 = sdata_train %>% ggplot(aes(x=PH, y = OxygenFiller)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl3 = sdata_train %>% ggplot(aes(x=PH, y = Temperature)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl4 = sdata_train %>% ggplot(aes(x=PH, y = CarbRel)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl5 = sdata_train %>% ggplot(aes(x=PH, y = AirPressurer)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl6 = sdata_train %>% ggplot(aes(x=PH, y = PSCCO2)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
pl7 = sdata_train %>% ggplot(aes(x=PH, y = FillPressure)) + geom_point(size=0.3) + theme(aspect.ratio = 1)
plotlist= list(pl1, pl2, pl3, pl4, pl5, pl6, pl7)
plot_grid(plotlist=plotlist, nrow = 2 )
median_MFR = median( sdata_train$MFR , na.rm = TRUE )
sdata_train %>%
mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> sdata_v2_train
sdata_test %>%
mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> sdata_v2_test
edata_v1 %>%
mutate( MFR = ifelse( is.na(MFR), median_MFR, MFR)) %>%
mutate( BrandCode = ifelse( is.na(BrandCode), "E", BrandCode)) -> edata_v2
# Build the caret function to preprocess the Chemical data and impute missing values.
# There is a bug in caret which causes tibbles to be rejected. They need to be cast as data frames.
# ---------------------------------------------------------------------------------
preProcFunc = preProcess(as.data.frame(sdata_v2_train[,3:33]) , method = c("BoxCox", "center", "scale", "knnImpute") )
# Becomes the source data for the model building
sdata_train_pp = predict( preProcFunc, as.data.frame(sdata_v2_train ) )
# Becomes the final version of test data for validation
sdata_test_pp = predict( preProcFunc, as.data.frame(sdata_v2_test) )
# Need to generate predictions based on test data without known response
edata_test_pp = predict( preProcFunc, as.data.frame(edata_v2) )
sdata_trainX_pp = sdata_train_pp %>% select( -PH)
sdata_testX_pp = sdata_test_pp %>% select(-PH)
edata_testX_pp = edata_test_pp %>% select( -PH)
skim(sdata_train_pp)
skim(sdata_test_pp)
skim(edata_test_pp)
M = cor(sdata_v1[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )
Mpp = cor(sdata_train_pp[,2:33], use = "pairwise.complete.obs")
corrplot::corrplot(M, method = "ellipse", order = "hclust", addrect = 6 , tl.cex = 0.7 )
readr::write_csv(sdata_train_pp, "sdata_train_pp.csv", append = FALSE )
readr::write_csv(sdata_test_pp, "sdata_test_pp.csv", append = FALSE )
readr::write_csv(edata_test_pp, "edata_test_pp.csv", append = FALSE )
file_df = tibble( filename = c("sdata_train_pp.csv", "sdata_test_pp.csv", "edata_test_pp.csv"),
primary_key = c("id - consistent with sdata_test_pp", "id - consistent with sdata_train_pp", "id - standalone") ,
response = c("PH - unchanged", "PH - unchanged", "PH - unchanged" ) ,
numeric_columns = c("BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed","BoxCox, center, scaled, imputed") ,
num_rows = c(2055, 512, 267 ) ,
file_source = c("StudentData.xlsx", "StudentData.xlsx", "StudentEvaluation.xlsx")
)
file_df %>% kable(caption = "Pre-Processed Data Files") %>% kable_styling(bootstrap_options = c("hover", "striped"
),
position = "left")
sdata_train_pp = read_csv("sdata_train_pp.csv", col_types = cols( .default = "?", id = "i" , BrandCode = "f")) %>% as_tibble()
skim(sdata_train_pp) %>% yank("numeric") %>% head(n=5)
skim(sdata_train_pp) %>% yank("character")
sdata_test_pp = read_csv("sdata_test_pp.csv", col_types = cols( .default = "?", id = "i" )) %>% as_tibble()
skim(sdata_test_pp) %>% yank("numeric") %>% head(n=5)
skim(sdata_test_pp) %>% yank("character")
edata_test_pp = read_csv("edata_test_pp.csv", col_types = cols( .default = "?", id = "i" )) %>% as_tibble()
skim(edata_test_pp) %>% yank("numeric") %>% head(n=5)
skim(sdata_test_pp) %>% yank("character")
toc()## 48.625 sec elapsed